Sweet Lift Taxi company has collected data on taxi orders at airports. Their aim is to predict the amount of taxi orders for the next hour, in order to allocate more drivers for peak hours. We will build a model with an RMSE lower than 48.
# !pip install --user plotly_express
# Import libraries
import pandas as pd
import numpy as np
import plotly_express as px
from matplotlib import pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor, VotingRegressor, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse, mean_absolute_error as mae, make_scorer
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
import lightgbm as lgb
from catboost import CatBoostRegressor, Pool
# read dataframe
df = pd.read_csv('datasets/taxi.csv', parse_dates=['datetime'], index_col=['datetime'])
# sorting index
df.sort_index(inplace=True)
# checking if index is monotonic
print(df.index.is_monotonic)
True
# look at dataframe
df.head()
| num_orders | |
|---|---|
| datetime | |
| 2018-03-01 00:00:00 | 9 |
| 2018-03-01 00:10:00 | 14 |
| 2018-03-01 00:20:00 | 28 |
| 2018-03-01 00:30:00 | 20 |
| 2018-03-01 00:40:00 | 32 |
# info on num orders columns
df.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 26496 entries, 2018-03-01 00:00:00 to 2018-08-31 23:50:00 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 num_orders 26496 non-null int64 dtypes: int64(1) memory usage: 414.0 KB
# confirming no missing values
df.isna().sum()
num_orders 0 dtype: int64
# resample data by the hour
df = df.resample('1H').sum()
We loaded the data and then converted the dates into datetime format. We then made the datetime column our index, and sorted the index. We checked to make sure the data was free of missing values. Following that, we resampled the data by the hour.
# summary statistics on the number of orders
df.describe()
| num_orders | |
|---|---|
| count | 4416.000000 |
| mean | 84.422781 |
| std | 45.023853 |
| min | 0.000000 |
| 25% | 54.000000 |
| 50% | 78.000000 |
| 75% | 107.000000 |
| max | 462.000000 |
# hourly number of orders
fig = px.line(df.num_orders, title='Total Hourly Number of Orders', template='ggplot2', height=600, labels={'value': 'Number of Orders'})
fig.update_xaxes(rangeslider_visible=True,
rangeselector=dict(
buttons=list([
dict(count=1, label='1m', step='month', stepmode='backward'),
dict(count=6, label='6m', step='month', stepmode='backward')
])
)
)
fig.show()
This is a visual of the timeseries data sampled on the hour. The y axis shows the number of taxi orders.
# distribution of orders
px.box(df.num_orders, title='Distribution of Orders', template='ggplot2', labels={'variable': 'Orders', 'value': 'count'}, height=600)
Here, we have the distribution of the number of orders. We see some outliers with values above 186. The average is 78 orders.
# Average daily orders
numbers = [6, 12, 24, 168, 720]
for i in numbers:
px.line(df.num_orders.rolling(i).mean(), title=f'Mean Number of Orders per {i} Hours', template='ggplot2', labels={'value': 'Number of Orders', 'datetime':'Dates'}, height=600).show()
These visuals show the number of orders resampled for 6 hours, 12 hours, per day, per week, and per month. These visualizations allow us the clearly see the trend in orders increase gradually from April to August.
# decomposed dataset
decomposed = seasonal_decompose(df)
# decomposed trend
px.line(decomposed.trend, title='Trend', template='ggplot2')
# decomposed seasonality
px.line(decomposed.seasonal, title='Seasonality', template='ggplot2')
# decomposed residual
px.line(decomposed.resid, title='Residual', template='ggplot2')
These visualizations show us the decomposed trends of the data. The trend is the same chart as the resampled daily chart. We see the seasonality chart is tight, which may be due to the short window of time given by the data. The residuals generally fluctuate around 0, but starts to show outliers in August.
# difference in number of orders
fig = px.line(df.num_orders-df.num_orders.shift(), title='Difference in Number of Orders', template='ggplot2', height=700)
fig.update_xaxes(rangeslider_visible=True,
rangeselector=dict(
buttons=list([
dict(count=1, label='1m', step='month', stepmode='backward'),
dict(count=6, label='6m', step='month', stepmode='backward')
])
)
)
fig.show()
Looking at the shifted difference in the number of orders, we see the values increase as time increases. These differences become more pronounced from August onward.
# Making features, max lag 24 and rolling mean 24
def make_features(data, max_lag, rolling_mean_size):
data['year'] = data.index.year
data['month'] = data.index.month
data['day'] = data.index.day
data['dayofweek'] = data.index.dayofweek
data['hour'] = data.index.hour
for lag in range(1, max_lag + 1):
data['lag_{}'.format(lag)] = data['num_orders'].shift(lag)
data['rolling_mean'] = (
data['num_orders'].shift().rolling(rolling_mean_size).mean()
)
make_features(df, 24, 24)
# splitting dataset to train, valid, and test
train, test = train_test_split(df, shuffle=False, test_size=0.1, random_state=19)
train, valid = train_test_split(train, shuffle=False, test_size=0.11, random_state=19)
print('Train Dataset = ', ' Start : ', train.index.min(), ' End : ', train.index.max(), ' Difference : ', abs(train.index.min() - train.index.max()))
print('Valid Dataset = ', ' Start : ', valid.index.min(), ' End : ', valid.index.max(), ' Difference : ', abs(valid.index.min() - valid.index.max()))
print('Test Dataset = ', ' Start : ', test.index.min(), ' End : ', test.index.max(), ' Difference : ', abs(test.index.min() - test.index.max()))
Train Dataset = Start : 2018-03-01 00:00:00 End : 2018-07-26 07:00:00 Difference : 147 days 07:00:00 Valid Dataset = Start : 2018-07-26 08:00:00 End : 2018-08-13 13:00:00 Difference : 18 days 05:00:00 Test Dataset = Start : 2018-08-13 14:00:00 End : 2018-08-31 23:00:00 Difference : 18 days 09:00:00
# Dropping missing values from datasets
train = train.dropna()
valid = valid.dropna()
test = test.dropna()
# Splitting target and features
X_train = train.drop(columns='num_orders')
y_train = train.num_orders
X_valid = valid.drop(columns='num_orders')
y_valid = valid.num_orders
X_test = test.drop(columns='num_orders')
y_test = test.num_orders
features = df.drop(columns='num_orders')
target = df.num_orders
# visual of train test split
fig, ax = plt.subplots(figsize=(25,15))
train.num_orders.plot(ax=ax, label='Training Set', title='Data Train/Test Split')
valid.num_orders.plot(ax=ax, label='Valid Set', color='green')
test.num_orders.plot(ax=ax, label='Test Set')
ax.axvline('2018-07-26 08:00:00', color='black', ls='--')
ax.axvline('2018-08-13 14:00:00', color='black', ls='--')
ax.legend(['Training Set', 'Test Set'])
plt.show()
We define a function to make features, with a max lag of 24 and a rolling mean of 24. We then split the data into three parts: train, valid, and test. We will train the models with the training data, then tune the models with the validation set. The test set is reserved for evaluating the performance of the final model we choose. Since it is crucial to have adequate training data, we limited the validation and test sets to 10% of the data each. This leaves roughly 80% of the data for training. Furthermore, with time series, we can not randomly select points in our data to split, so shuffle was set to false. This gives us the correct sequence in the order of the different sets, made evident by the last figure.
# linear regression
lr = LinearRegression() # initialize model constructor
lr.fit(X_train, y_train) # train model on training set
predictions_valid_lr = lr.predict(X_valid) # get model predictions on validation set
result = mse(y_valid, predictions_valid_lr) ** 0.5 # calculate RMSE on validation set
print("RMSE of the linear regression model on the validation set:", result)
RMSE of the linear regression model on the validation set: 34.282142440117
# Get the feature importances of the best model
lr_importances = lr.coef_
# Create a dataframe with the feature importances and the corresponding feature names
lr_importances_df = pd.DataFrame({'feature':X_train.columns, 'coefficients':lr.coef_})
# Sort the dataframe by importance
lr_importances_df.sort_values(by='coefficients', ascending=False, inplace=True)
lr_importances_df
| feature | coefficients | |
|---|---|---|
| 29 | rolling_mean | 3.443462e+12 |
| 1 | month | 3.741096e+00 |
| 4 | hour | 6.812351e-01 |
| 2 | day | 1.919355e-01 |
| 0 | year | -1.203887e-15 |
| 3 | dayofweek | -2.726262e-01 |
| 28 | lag_24 | -1.434776e+11 |
| 5 | lag_1 | -1.434776e+11 |
| 12 | lag_8 | -1.434776e+11 |
| 6 | lag_2 | -1.434776e+11 |
| 27 | lag_23 | -1.434776e+11 |
| 11 | lag_7 | -1.434776e+11 |
| 15 | lag_11 | -1.434776e+11 |
| 20 | lag_16 | -1.434776e+11 |
| 17 | lag_13 | -1.434776e+11 |
| 26 | lag_22 | -1.434776e+11 |
| 9 | lag_5 | -1.434776e+11 |
| 16 | lag_12 | -1.434776e+11 |
| 21 | lag_17 | -1.434776e+11 |
| 7 | lag_3 | -1.434776e+11 |
| 8 | lag_4 | -1.434776e+11 |
| 14 | lag_10 | -1.434776e+11 |
| 10 | lag_6 | -1.434776e+11 |
| 25 | lag_21 | -1.434776e+11 |
| 18 | lag_14 | -1.434776e+11 |
| 19 | lag_15 | -1.434776e+11 |
| 22 | lag_18 | -1.434776e+11 |
| 23 | lag_19 | -1.434776e+11 |
| 24 | lag_20 | -1.434776e+11 |
| 13 | lag_9 | -1.434776e+11 |
The rolling mean is the coefficient with the highest value among the linear regression features. We achieve an RMSE score of 34.28 with the validation set.
# Decision Tree
best_model = None
best_result = 50
best_depth = 0
for depth in range(1, 6): # choose hyperparameter range
dtr = DecisionTreeRegressor(random_state=19, max_depth=depth)
dtr.fit(X_train, y_train) # train model on training set
predictions_valid_dtr = dtr.predict(X_valid) # get model predictions on validation set
result = mse(y_valid, predictions_valid_dtr) ** 0.5
if result < best_result:
best_model = dtr
best_result = result
best_depth = depth
print(f"RMSE of the best model on the validation set (max_depth = {best_depth}): {best_result}")
RMSE of the best model on the validation set (max_depth = 5): 38.54337041017864
# Get the feature importances of the best model
dtr_importances = dtr.feature_importances_
# Create a dataframe with the feature importances and the corresponding feature names
dtr_importances_df = pd.DataFrame({'feature':X_train.columns, 'importance':dtr.feature_importances_})
# Sort the dataframe by importance
dtr_importances_df.sort_values(by='importance', ascending=False, inplace=True)
# top 10 feature importances
px.pie(dtr_importances_df.head(10), names='feature', values='importance', title='Top 10 Feature Importance for Decision Tree Regression', template='ggplot2', hole=0.2)
The lag 24 feature is the most important in the decesion tree regression. We achieve an RMSE score of 38.54 with the validation set.
# Random Forest
best_model = None
best_result = 50
best_est = 0
best_depth = 0
for est in range(600, 601):
for depth in range (100, 101):
rf = RandomForestRegressor(random_state=19, n_estimators=est, max_depth=depth)
rf.fit(X_train, y_train) # train model on training set
predictions_valid = rf.predict(X_valid) # get model predictions on validation set
result = mse(y_valid, predictions_valid) ** 0.5 # calculate RMSE on validation set
if result < best_result:
best_model = rf
best_result = result
best_est = est
best_depth = depth
print("RMSE of the best model on the validation set:", best_result, "n_estimators:", best_est, "best_depth:", depth)
RMSE of the best model on the validation set: 32.013650992753774 n_estimators: 600 best_depth: 100
# Get the feature importances of the best model
rf_importances = best_model.feature_importances_
# Create a dataframe with the feature importances and the corresponding feature names
rf_importances_df = pd.DataFrame({'feature':X_train.columns, 'importance':rf.feature_importances_})
# Sort the dataframe by importance
rf_importances_df.sort_values(by='importance', ascending=False, inplace=True)
# top 10 feature importances
px.pie(rf_importances_df.head(10), names='feature', values='importance', title='Top 10 Feature Importance for Random Forest Regression', template='ggplot2', hole=0.2)
The Lag 24 feature has the greatest importance in the random forest model. The RMSE score is 32.01.
# ADA Boost
regr = AdaBoostRegressor(random_state=19, n_estimators=100)
regr.fit(X_train, y_train)
predictions_valid_regr = regr.predict(X_valid) # get model predictions on validation set
result = mse(y_valid, predictions_valid_regr) ** 0.5 # calculate RMSE on validation set
print("RMSE of the ada boost regression model on the validation set:", result)
RMSE of the ada boost regression model on the validation set: 34.89293536381952
# table of feature importance
regr_imp = [t for t in zip(features, regr.feature_importances_)]
regr_imp_df = pd.DataFrame(regr_imp, columns=['feature', 'varimp'])
regr_imp_df = regr_imp_df.sort_values('varimp', ascending=False)
# top 10 feature importances
px.pie(regr_imp_df.head(10), names='feature', values='varimp', title='Top 10 Feature Importance for Ada Boost Regresion', hole=.2, template='ggplot2')
The lag 24 feature is the most important, followed by the lag 1, among the Ada Boost model. The RMSE score is 34.89.
# Gradient Boost
gbr = GradientBoostingRegressor(random_state=19, learning_rate=0.2, n_estimators=1000, verbose=100, max_depth=3)
gbr.fit(X_train, y_train)
predictions_valid_gbr = gbr.predict(X_valid)
result = mse(y_valid, predictions_valid_gbr) ** 0.5 # calculate RMSE on validation set
print("RMSE of the gradient boosting model on the validation set:", result)
Iter Train Loss Remaining Time
1 1055.7262 56.79s
2 902.6644 56.73s
3 800.7340 58.67s
4 723.8976 55.13s
5 664.3207 56.17s
6 624.8460 57.00s
7 593.0704 58.01s
8 561.8075 54.66s
9 540.4896 53.81s
10 523.9846 54.70s
11 507.7131 52.73s
12 496.1685 51.32s
13 485.1000 50.20s
14 477.5934 48.89s
15 468.6034 47.74s
16 460.8223 49.19s
17 452.4853 48.90s
18 446.5326 48.48s
19 442.2676 48.30s
20 433.9193 47.45s
21 430.5671 46.77s
22 423.8965 46.20s
23 422.1899 45.54s
24 417.2277 45.18s
25 412.5752 44.69s
26 408.2447 44.12s
27 405.0043 43.60s
28 401.5569 43.03s
29 396.6786 42.54s
30 394.4030 42.05s
31 391.1918 41.84s
32 387.1093 41.48s
33 385.1414 41.09s
34 384.1328 40.72s
35 381.3070 40.50s
36 378.0224 40.35s
37 375.9459 40.08s
38 373.1744 39.92s
39 370.8267 39.62s
40 367.8561 39.42s
41 366.4715 39.14s
42 362.9289 38.90s
43 360.9791 38.69s
44 360.2535 38.57s
45 358.7802 38.39s
46 355.6654 38.16s
47 353.4201 38.50s
48 350.8707 39.03s
49 348.9316 39.49s
50 347.6763 40.06s
51 345.4044 40.85s
52 342.4857 41.47s
53 341.0666 41.60s
54 338.8986 41.92s
55 337.3331 42.35s
56 334.7584 42.53s
57 333.3173 42.73s
58 331.8567 43.00s
59 329.4319 42.92s
60 327.1108 42.65s
61 324.1376 43.00s
62 321.6674 43.67s
63 320.2303 44.15s
64 318.8377 44.64s
65 318.4357 44.58s
66 317.2843 44.66s
67 316.8266 44.76s
68 315.5911 44.68s
69 314.0183 44.71s
70 313.1352 44.66s
71 312.0731 44.61s
72 310.9500 44.60s
73 309.7945 44.50s
74 308.2055 44.47s
75 307.3895 44.41s
76 305.6352 44.36s
77 303.2277 44.34s
78 300.9343 44.30s
79 299.6684 44.31s
80 299.1902 44.44s
81 298.0853 44.35s
82 296.1359 44.42s
83 294.5330 44.25s
84 293.1980 44.35s
85 291.5203 44.30s
86 289.6140 44.12s
87 288.9266 43.89s
88 287.4763 43.88s
89 286.4075 44.03s
90 284.5592 43.88s
91 284.3237 43.97s
92 283.3294 43.83s
93 282.0784 43.96s
94 281.5397 43.95s
95 280.3133 43.87s
96 278.8913 43.88s
97 277.4948 43.79s
98 276.9515 43.76s
99 275.0107 43.69s
100 273.5208 43.68s
101 272.7373 43.57s
102 271.6324 43.58s
103 270.9054 43.42s
104 269.5380 43.48s
105 269.1883 43.35s
106 267.8702 43.26s
107 266.5946 43.25s
108 266.1640 43.25s
109 264.6132 43.13s
110 262.9159 42.93s
111 262.1642 42.93s
112 261.4556 42.81s
113 260.7288 42.77s
114 259.3995 42.67s
115 258.4280 42.66s
116 257.4415 42.59s
117 256.9665 42.58s
118 256.4430 42.49s
119 255.2727 42.39s
120 254.0211 42.42s
121 253.1970 42.31s
122 252.0520 42.34s
123 251.8655 42.16s
124 251.0359 42.07s
125 250.1147 42.07s
126 248.8988 42.03s
127 246.8937 42.10s
128 245.9952 42.09s
129 244.7595 42.00s
130 243.4576 42.20s
131 241.8065 42.25s
132 241.4891 42.26s
133 240.0669 42.09s
134 239.0404 42.07s
135 238.7721 42.01s
136 237.4362 41.97s
137 236.7729 41.95s
138 235.4824 41.88s
139 234.7351 41.84s
140 233.7724 41.82s
141 232.7002 41.77s
142 231.4782 41.72s
143 229.6106 41.72s
144 229.2437 41.76s
145 228.9692 41.60s
146 228.0804 41.52s
147 227.4072 41.50s
148 226.6947 41.45s
149 225.9220 41.51s
150 225.6887 41.43s
151 225.0323 41.44s
152 223.4343 41.40s
153 223.2424 41.27s
154 223.0911 41.30s
155 222.0542 41.19s
156 220.5953 41.20s
157 219.6860 41.08s
158 218.6921 41.08s
159 217.7972 40.97s
160 216.9563 40.97s
161 215.9278 40.87s
162 214.5232 40.73s
163 213.3469 40.76s
164 212.3832 40.73s
165 211.7076 40.68s
166 210.2515 40.65s
167 209.9564 40.60s
168 208.7963 40.55s
169 207.9157 40.53s
170 207.6545 40.46s
171 207.2754 40.44s
172 206.1688 40.40s
173 205.9700 40.38s
174 204.6435 40.32s
175 203.6276 40.33s
176 203.1619 40.32s
177 202.2022 40.26s
178 201.4156 40.22s
179 201.2978 40.14s
180 200.2885 40.11s
181 200.1798 40.03s
182 199.3669 39.99s
183 198.3831 39.95s
184 197.4131 39.88s
185 196.5684 39.77s
186 195.4207 39.77s
187 194.5461 39.68s
188 193.9176 39.66s
189 192.9804 39.58s
190 192.2195 39.56s
191 191.7909 39.49s
192 191.0509 39.45s
193 190.5408 39.43s
194 189.7018 39.35s
195 188.9814 39.33s
196 188.6125 39.25s
197 188.1825 39.22s
198 187.6218 39.14s
199 186.9988 39.11s
200 186.1205 39.03s
201 185.8713 38.99s
202 185.4968 38.93s
203 184.6223 38.89s
204 183.9367 38.82s
205 183.4210 38.78s
206 182.8387 38.84s
207 182.5824 38.84s
208 181.9990 38.79s
209 181.5682 38.68s
210 180.8851 38.77s
211 180.0577 38.80s
212 179.3731 38.81s
213 178.8296 38.75s
214 178.4575 38.79s
215 178.3745 38.76s
216 178.1820 38.78s
217 177.5213 38.82s
218 177.0981 38.85s
219 176.4036 38.86s
220 175.7206 38.80s
221 175.2264 38.70s
222 174.7505 38.61s
223 174.1359 38.51s
224 173.6011 38.40s
225 172.8501 38.29s
226 172.1924 38.19s
227 171.7496 38.09s
228 171.2453 38.06s
229 170.7370 38.03s
230 170.2981 37.93s
231 169.7500 37.84s
232 169.4488 37.73s
233 168.7189 37.63s
234 168.0393 37.52s
235 167.0320 37.43s
236 166.8765 37.32s
237 166.2960 37.22s
238 165.7055 37.13s
239 165.3356 37.02s
240 165.1666 36.92s
241 164.3345 36.83s
242 163.6705 36.76s
243 163.4298 36.67s
244 163.1072 36.58s
245 162.8726 36.48s
246 162.4190 36.40s
247 162.1065 36.31s
248 162.0323 36.23s
249 161.2361 36.18s
250 160.9223 36.09s
251 160.7824 35.99s
252 160.6567 35.89s
253 160.0059 35.79s
254 159.5545 35.69s
255 158.6141 35.59s
256 157.6411 35.49s
257 156.8622 35.40s
258 156.3643 35.32s
259 155.6960 35.27s
260 154.8697 35.20s
261 154.3736 35.11s
262 154.2928 35.02s
263 154.1702 34.92s
264 153.9868 34.82s
265 153.3099 34.73s
266 153.0980 34.63s
267 152.6246 34.54s
268 152.1311 34.45s
269 151.5465 34.35s
270 151.1301 34.27s
271 150.7190 34.21s
272 150.1633 34.12s
273 149.4289 34.04s
274 148.7573 33.96s
275 148.2664 33.88s
276 147.7297 33.78s
277 147.2369 33.69s
278 146.6337 33.61s
279 146.0800 33.52s
280 145.5741 33.43s
281 144.8494 33.34s
282 144.1090 33.26s
283 143.8503 33.17s
284 143.6853 33.08s
285 143.3408 32.99s
286 143.2404 32.90s
287 143.1026 32.81s
288 142.7623 32.73s
289 142.3508 32.65s
290 141.6417 32.57s
291 140.9979 32.49s
292 140.3641 32.40s
293 139.7470 32.31s
294 139.1171 32.23s
295 138.8241 32.14s
296 138.3002 32.04s
297 137.7256 31.95s
298 137.4539 31.87s
299 137.2290 31.79s
300 136.8925 31.70s
301 136.3540 31.62s
302 135.8399 31.53s
303 135.6445 31.45s
304 135.5275 31.37s
305 134.9971 31.29s
306 134.4919 31.20s
307 133.9457 31.11s
308 133.5933 31.02s
309 133.5518 30.93s
310 133.3521 30.84s
311 133.0897 30.76s
312 132.5042 30.68s
313 132.1678 30.59s
314 131.3247 30.50s
315 130.7066 30.42s
316 130.0524 30.34s
317 129.8201 30.27s
318 129.2236 30.19s
319 128.9383 30.11s
320 128.5765 30.03s
321 128.1937 29.95s
322 127.9376 29.86s
323 127.3620 29.77s
324 127.1299 29.68s
325 126.8163 29.60s
326 126.2178 29.51s
327 125.8056 29.41s
328 125.4083 29.34s
329 124.8340 29.27s
330 124.3252 29.19s
331 123.9439 29.11s
332 123.7074 29.04s
333 123.1997 28.96s
334 122.8533 28.86s
335 122.7404 28.76s
336 122.1975 28.69s
337 121.7528 28.60s
338 121.2797 28.50s
339 120.8439 28.43s
340 120.4679 28.34s
341 119.9809 28.28s
342 119.7189 28.20s
343 119.2674 28.12s
344 119.1212 28.05s
345 118.8957 27.97s
346 118.5084 27.90s
347 118.0651 27.82s
348 117.9895 27.76s
349 117.9291 27.73s
350 117.5767 27.71s
351 117.3512 27.64s
352 117.0751 27.57s
353 116.8393 27.51s
354 116.7934 27.43s
355 116.4581 27.35s
356 116.4185 27.28s
357 116.0682 27.21s
358 115.7497 27.15s
359 115.6748 27.07s
360 115.1859 27.01s
361 114.8149 26.94s
362 114.4470 26.88s
363 114.1238 26.81s
364 113.7294 26.73s
365 113.1831 26.66s
366 113.0356 26.58s
367 112.6315 26.51s
368 112.0882 26.44s
369 111.6892 26.36s
370 111.3174 26.29s
371 111.0084 26.21s
372 110.6746 26.14s
373 110.3062 26.15s
374 109.9633 26.11s
375 109.5230 26.06s
376 109.0951 25.99s
377 108.7454 25.93s
378 108.3521 25.87s
379 107.9858 25.79s
380 107.8854 25.72s
381 107.7337 25.65s
382 107.4410 25.58s
383 106.9430 25.51s
384 106.7746 25.44s
385 106.6139 25.38s
386 106.1896 25.32s
387 105.7738 25.25s
388 105.5081 25.17s
389 105.3299 25.10s
390 105.0266 25.03s
391 104.7402 24.96s
392 104.4191 24.89s
393 104.0869 24.83s
394 103.8967 24.77s
395 103.4883 24.71s
396 103.2495 24.64s
397 103.1218 24.57s
398 102.7776 24.50s
399 102.3275 24.43s
400 102.2159 24.36s
401 102.0001 24.30s
402 101.6897 24.22s
403 101.4587 24.15s
404 101.4068 24.09s
405 101.1238 24.02s
406 100.9377 23.95s
407 100.5779 23.88s
408 100.2145 23.81s
409 99.8983 23.74s
410 99.5763 23.67s
411 99.3066 23.61s
412 98.9422 23.54s
413 98.7017 23.48s
414 98.5452 23.43s
415 98.1847 23.36s
416 97.9476 23.30s
417 97.8798 23.23s
418 97.6242 23.16s
419 97.3896 23.07s
420 97.3446 23.02s
421 96.9197 22.97s
422 96.6910 22.89s
423 96.6509 22.83s
424 96.4408 22.77s
425 96.0325 22.71s
426 95.8514 22.65s
427 95.5637 22.59s
428 95.3167 22.52s
429 95.1737 22.47s
430 94.8804 22.42s
431 94.6497 22.35s
432 94.5731 22.28s
433 94.5448 22.23s
434 94.3330 22.17s
435 94.0629 22.11s
436 93.7731 22.05s
437 93.6598 21.99s
438 93.6135 21.93s
439 93.3674 21.87s
440 93.1153 21.80s
441 93.0497 21.73s
442 92.9693 21.68s
443 92.9287 21.61s
444 92.6099 21.55s
445 92.3089 21.50s
446 92.0667 21.43s
447 91.7630 21.38s
448 91.5105 21.33s
449 91.3078 21.27s
450 90.9470 21.20s
451 90.5933 21.13s
452 90.4383 21.08s
453 90.2705 21.02s
454 90.0625 20.95s
455 89.7902 20.90s
456 89.6502 20.84s
457 89.5671 20.77s
458 89.4432 20.73s
459 89.3882 20.68s
460 89.0465 20.63s
461 88.8054 20.57s
462 88.5735 20.51s
463 88.1279 20.46s
464 87.8351 20.40s
465 87.5551 20.33s
466 87.3919 20.29s
467 87.2585 20.22s
468 87.0797 20.18s
469 86.7864 20.12s
470 86.4725 20.07s
471 86.2633 20.01s
472 85.9592 19.97s
473 85.7061 19.92s
474 85.5653 19.86s
475 85.1883 19.81s
476 84.9246 19.77s
477 84.5910 19.71s
478 84.3501 19.66s
479 84.2046 19.60s
480 83.9294 19.54s
481 83.7545 19.48s
482 83.4281 19.42s
483 83.1932 19.37s
484 82.9807 19.33s
485 82.8612 19.27s
486 82.6281 19.21s
487 82.3459 19.16s
488 82.3077 19.10s
489 82.1382 19.04s
490 81.9576 18.98s
491 81.6893 18.94s
492 81.4510 18.89s
493 81.0863 18.83s
494 80.8377 18.77s
495 80.6025 18.71s
496 80.3971 18.66s
497 80.2032 18.61s
498 80.0493 18.55s
499 79.8719 18.50s
500 79.6783 18.44s
501 79.5657 18.40s
502 79.4650 18.34s
503 79.1251 18.30s
504 78.9823 18.24s
505 78.9140 18.20s
506 78.5656 18.13s
507 78.5039 18.09s
508 78.2727 18.03s
509 78.0015 17.99s
510 77.8739 17.94s
511 77.8320 17.88s
512 77.5516 17.84s
513 77.3552 17.78s
514 77.2392 17.74s
515 76.9273 17.68s
516 76.6854 17.64s
517 76.4324 17.60s
518 76.3278 17.54s
519 76.1334 17.50s
520 75.8134 17.45s
521 75.7904 17.41s
522 75.5370 17.35s
523 75.2642 17.31s
524 74.9648 17.26s
525 74.7090 17.22s
526 74.6401 17.16s
527 74.5959 17.12s
528 74.2849 17.07s
529 74.0021 17.01s
530 73.8100 16.97s
531 73.5809 16.92s
532 73.2991 16.86s
533 73.0454 16.81s
534 72.8420 16.77s
535 72.6681 16.72s
536 72.4411 16.66s
537 72.1631 16.63s
538 71.9370 16.57s
539 71.8100 16.52s
540 71.6491 16.47s
541 71.5069 16.43s
542 71.2218 16.37s
543 71.0776 16.32s
544 70.9463 16.28s
545 70.6818 16.23s
546 70.6452 16.18s
547 70.4464 16.14s
548 70.2820 16.10s
549 69.9654 16.05s
550 69.9267 16.01s
551 69.7929 15.96s
552 69.6631 15.91s
553 69.4370 15.87s
554 69.2039 15.83s
555 69.0567 15.78s
556 68.8276 15.74s
557 68.5376 15.69s
558 68.3891 15.65s
559 68.1473 15.60s
560 67.9442 15.55s
561 67.8111 15.51s
562 67.7899 15.46s
563 67.6026 15.42s
564 67.4991 15.37s
565 67.3599 15.33s
566 67.1283 15.28s
567 66.9311 15.24s
568 66.7955 15.19s
569 66.5001 15.14s
570 66.4032 15.11s
571 66.3060 15.06s
572 66.1104 15.02s
573 65.9159 14.97s
574 65.7521 14.93s
575 65.7150 14.88s
576 65.4147 14.83s
577 65.1236 14.80s
578 64.9326 14.75s
579 64.7653 14.70s
580 64.6209 14.66s
581 64.5910 14.61s
582 64.5581 14.58s
583 64.4211 14.54s
584 64.1944 14.49s
585 63.9831 14.45s
586 63.9229 14.41s
587 63.7253 14.36s
588 63.4544 14.32s
589 63.3722 14.27s
590 63.3530 14.23s
591 63.0552 14.19s
592 62.9333 14.15s
593 62.7625 14.10s
594 62.6070 14.05s
595 62.3913 14.01s
596 62.0937 13.97s
597 61.8982 13.92s
598 61.7319 13.89s
599 61.5614 13.84s
600 61.4060 13.80s
601 61.2164 13.76s
602 61.0710 13.72s
603 60.9116 13.68s
604 60.8973 13.63s
605 60.7552 13.59s
606 60.5312 13.55s
607 60.3324 13.52s
608 60.0307 13.47s
609 59.9659 13.43s
610 59.8900 13.39s
611 59.7535 13.35s
612 59.5479 13.31s
613 59.3687 13.26s
614 59.1659 13.21s
615 59.0192 13.17s
616 58.9590 13.13s
617 58.7784 13.09s
618 58.5905 13.04s
619 58.3234 13.00s
620 58.1259 12.95s
621 57.8951 12.92s
622 57.7173 12.87s
623 57.5178 12.83s
624 57.3506 12.78s
625 57.1682 12.75s
626 57.0012 12.71s
627 56.8050 12.67s
628 56.6516 12.62s
629 56.5408 12.58s
630 56.3357 12.54s
631 56.1726 12.50s
632 55.9900 12.45s
633 55.7686 12.42s
634 55.5224 12.37s
635 55.3580 12.33s
636 55.2401 12.29s
637 55.1066 12.25s
638 54.9361 12.20s
639 54.7496 12.17s
640 54.5293 12.12s
641 54.4655 12.08s
642 54.3074 12.05s
643 54.0834 12.01s
644 53.9212 11.97s
645 53.7599 11.94s
646 53.5608 11.89s
647 53.4260 11.86s
648 53.3616 11.81s
649 53.2499 11.77s
650 53.0530 11.74s
651 52.9704 11.69s
652 52.8286 11.65s
653 52.7410 11.62s
654 52.6841 11.57s
655 52.6646 11.53s
656 52.5453 11.50s
657 52.3329 11.45s
658 52.1643 11.41s
659 52.0198 11.37s
660 51.8101 11.33s
661 51.7218 11.29s
662 51.6232 11.26s
663 51.5146 11.21s
664 51.3580 11.18s
665 51.3231 11.14s
666 51.2168 11.10s
667 51.0286 11.06s
668 50.7740 11.02s
669 50.5981 10.99s
670 50.4105 10.94s
671 50.2143 10.90s
672 50.0306 10.86s
673 49.9703 10.83s
674 49.8605 10.78s
675 49.6516 10.74s
676 49.5213 10.71s
677 49.4144 10.67s
678 49.2888 10.63s
679 49.1202 10.59s
680 48.9235 10.56s
681 48.7066 10.52s
682 48.6437 10.48s
683 48.6177 10.44s
684 48.5253 10.40s
685 48.3849 10.36s
686 48.2448 10.33s
687 48.1580 10.29s
688 48.0233 10.25s
689 47.9009 10.22s
690 47.8469 10.18s
691 47.7222 10.13s
692 47.5874 10.10s
693 47.4416 10.06s
694 47.2562 10.03s
695 47.1384 9.99s
696 46.9465 9.95s
697 46.8536 9.92s
698 46.7515 9.88s
699 46.6362 9.84s
700 46.4215 9.80s
701 46.3320 9.77s
702 46.2938 9.73s
703 46.1982 9.70s
704 46.0288 9.66s
705 45.9857 9.62s
706 45.8746 9.59s
707 45.7450 9.55s
708 45.5674 9.51s
709 45.4179 9.48s
710 45.2309 9.44s
711 45.1916 9.41s
712 45.0643 9.38s
713 45.0541 9.34s
714 45.0153 9.31s
715 45.0060 9.27s
716 44.9672 9.23s
717 44.8673 9.20s
718 44.8333 9.16s
719 44.7194 9.12s
720 44.6010 9.09s
721 44.4671 9.06s
722 44.2916 9.02s
723 44.0836 8.99s
724 43.9025 8.95s
725 43.7658 8.92s
726 43.6922 8.88s
727 43.5902 8.85s
728 43.4491 8.81s
729 43.3530 8.78s
730 43.2013 8.75s
731 43.0265 8.71s
732 42.9327 8.67s
733 42.8232 8.64s
734 42.6374 8.61s
735 42.5717 8.57s
736 42.4095 8.54s
737 42.2902 8.51s
738 42.2453 8.47s
739 42.1743 8.43s
740 42.1129 8.40s
741 42.0755 8.36s
742 41.9594 8.33s
743 41.8395 8.30s
744 41.7572 8.26s
745 41.7133 8.23s
746 41.6875 8.19s
747 41.5919 8.16s
748 41.4996 8.12s
749 41.4349 8.09s
750 41.3427 8.05s
751 41.2793 8.02s
752 41.2496 7.98s
753 41.1273 7.95s
754 41.0416 7.92s
755 41.0218 7.88s
756 40.9966 7.85s
757 40.9027 7.81s
758 40.8840 7.77s
759 40.7501 7.74s
760 40.6823 7.70s
761 40.6609 7.67s
762 40.5430 7.64s
763 40.5122 7.60s
764 40.3722 7.56s
765 40.1975 7.53s
766 40.0731 7.49s
767 39.9824 7.45s
768 39.8471 7.42s
769 39.7286 7.39s
770 39.5562 7.35s
771 39.5144 7.31s
772 39.3907 7.28s
773 39.2828 7.24s
774 39.2567 7.21s
775 39.2142 7.17s
776 39.1119 7.14s
777 38.9896 7.11s
778 38.8926 7.07s
779 38.8306 7.04s
780 38.7102 7.01s
781 38.5614 6.97s
782 38.4628 6.93s
783 38.3710 6.90s
784 38.3318 6.87s
785 38.1993 6.83s
786 38.1081 6.80s
787 38.0319 6.76s
788 37.9281 6.72s
789 37.8637 6.69s
790 37.7878 6.66s
791 37.6489 6.62s
792 37.5946 6.59s
793 37.4523 6.55s
794 37.3201 6.52s
795 37.2878 6.48s
796 37.1762 6.45s
797 37.0998 6.41s
798 37.0179 6.38s
799 36.9978 6.34s
800 36.8524 6.31s
801 36.7668 6.27s
802 36.6426 6.24s
803 36.5136 6.21s
804 36.4533 6.17s
805 36.4447 6.14s
806 36.3042 6.11s
807 36.2912 6.07s
808 36.2097 6.04s
809 36.1435 6.00s
810 36.0625 5.97s
811 35.9498 5.94s
812 35.9094 5.90s
813 35.7931 5.87s
814 35.6817 5.84s
815 35.5356 5.81s
816 35.4242 5.77s
817 35.3029 5.74s
818 35.2166 5.70s
819 35.1120 5.67s
820 34.9897 5.63s
821 34.8609 5.60s
822 34.7673 5.57s
823 34.6475 5.53s
824 34.6127 5.50s
825 34.5338 5.47s
826 34.4548 5.43s
827 34.4385 5.40s
828 34.3773 5.36s
829 34.3646 5.33s
830 34.3362 5.30s
831 34.2979 5.27s
832 34.1914 5.23s
833 34.1790 5.20s
834 34.0776 5.16s
835 33.9618 5.13s
836 33.8766 5.10s
837 33.8193 5.06s
838 33.6672 5.03s
839 33.5876 5.00s
840 33.4777 4.97s
841 33.3927 4.93s
842 33.2647 4.90s
843 33.1527 4.87s
844 33.0205 4.84s
845 32.9235 4.80s
846 32.8285 4.77s
847 32.7425 4.74s
848 32.6951 4.71s
849 32.6660 4.67s
850 32.5655 4.64s
851 32.4801 4.61s
852 32.3951 4.57s
853 32.3400 4.54s
854 32.2870 4.51s
855 32.2174 4.48s
856 32.1503 4.44s
857 32.0657 4.41s
858 32.0158 4.38s
859 31.9390 4.34s
860 31.9100 4.31s
861 31.8524 4.28s
862 31.7961 4.25s
863 31.7432 4.22s
864 31.6519 4.18s
865 31.5790 4.15s
866 31.5639 4.12s
867 31.4671 4.09s
868 31.3659 4.06s
869 31.3363 4.02s
870 31.3000 3.99s
871 31.1804 3.96s
872 31.1474 3.93s
873 31.0948 3.89s
874 31.0788 3.86s
875 30.9843 3.83s
876 30.9696 3.80s
877 30.9129 3.77s
878 30.8899 3.74s
879 30.8558 3.70s
880 30.8006 3.67s
881 30.7013 3.64s
882 30.6438 3.61s
883 30.5413 3.58s
884 30.5169 3.54s
885 30.4661 3.51s
886 30.4457 3.48s
887 30.3474 3.45s
888 30.2849 3.42s
889 30.1955 3.38s
890 30.1140 3.35s
891 30.0319 3.32s
892 29.9482 3.29s
893 29.9119 3.26s
894 29.8963 3.22s
895 29.8112 3.19s
896 29.7434 3.16s
897 29.6865 3.13s
898 29.6764 3.10s
899 29.6593 3.07s
900 29.6124 3.04s
901 29.5498 3.01s
902 29.4566 2.97s
903 29.3630 2.94s
904 29.2683 2.91s
905 29.2188 2.88s
906 29.1751 2.85s
907 29.1173 2.82s
908 29.0323 2.79s
909 28.9361 2.76s
910 28.8543 2.72s
911 28.7708 2.69s
912 28.7260 2.66s
913 28.6268 2.63s
914 28.5262 2.60s
915 28.3943 2.57s
916 28.3001 2.54s
917 28.2050 2.51s
918 28.1342 2.48s
919 28.1116 2.45s
920 28.0228 2.42s
921 27.9279 2.38s
922 27.8416 2.35s
923 27.7658 2.32s
924 27.6785 2.29s
925 27.6202 2.26s
926 27.5328 2.23s
927 27.4919 2.20s
928 27.4607 2.17s
929 27.3750 2.14s
930 27.3617 2.11s
931 27.3330 2.08s
932 27.2761 2.05s
933 27.1629 2.02s
934 27.0954 1.99s
935 27.0213 1.96s
936 26.9433 1.93s
937 26.8709 1.89s
938 26.8051 1.86s
939 26.7244 1.83s
940 26.6764 1.80s
941 26.6321 1.77s
942 26.6145 1.74s
943 26.5278 1.71s
944 26.4486 1.68s
945 26.3539 1.65s
946 26.3082 1.62s
947 26.2037 1.59s
948 26.1147 1.56s
949 26.0612 1.53s
950 25.9780 1.50s
951 25.9032 1.47s
952 25.8200 1.44s
953 25.7387 1.41s
954 25.6558 1.38s
955 25.5928 1.35s
956 25.5086 1.32s
957 25.4100 1.29s
958 25.3567 1.26s
959 25.3347 1.23s
960 25.2719 1.20s
961 25.2647 1.17s
962 25.1735 1.14s
963 25.0875 1.11s
964 24.9945 1.08s
965 24.9195 1.05s
966 24.8384 1.02s
967 24.7877 0.99s
968 24.7641 0.96s
969 24.7228 0.93s
970 24.6919 0.90s
971 24.6109 0.87s
972 24.5523 0.84s
973 24.4956 0.81s
974 24.4423 0.78s
975 24.3673 0.75s
976 24.3140 0.72s
977 24.2648 0.69s
978 24.2199 0.66s
979 24.2152 0.63s
980 24.1410 0.60s
981 24.0767 0.57s
982 24.0493 0.54s
983 24.0357 0.51s
984 24.0151 0.48s
985 23.9303 0.45s
986 23.9202 0.42s
987 23.8490 0.39s
988 23.7701 0.36s
989 23.7466 0.33s
990 23.6749 0.30s
991 23.6458 0.27s
992 23.6191 0.24s
993 23.5652 0.21s
994 23.5566 0.18s
995 23.5069 0.15s
996 23.4597 0.12s
997 23.4124 0.09s
998 23.3412 0.06s
999 23.2403 0.03s
1000 23.1917 0.00s
RMSE of the gradient boosting model on the validation set: 35.11544177065954
# table of feature importance
gbr_imp = [t for t in zip(features, gbr.feature_importances_)]
gbr_imp_df = pd.DataFrame(gbr_imp, columns=['feature', 'varimp'])
gbr_imp_df = gbr_imp_df.sort_values('varimp', ascending=False)
# top 10 feature importances
px.pie(gbr_imp_df.head(10), names='feature', values='varimp', title='Top 10 Feature Importance for Gradient Boosting', hole=.2, template='ggplot2')
The lag 24 feature is the most important, followed by the lag 1, among the Gradient Boost model. The RMSE score is 35.12.
# XGB
xgbr = XGBRegressor(learning_rate=0.09, n_estimators=800, eval_metric='rmse', random_state=19, max_depth=6, early_stopping_rounds=500)
xgbr.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_valid, y_valid)], verbose=20)
# Make predictions on the test set
predictions_xgbr = xgbr.predict(X_valid)
result = mse(y_valid, predictions_xgbr) ** 0.5 # calculate RMSE on validation set
print()
print("RMSE of the xgbm model on the validation set:", result)
[0] validation_0-rmse:75.35697 validation_1-rmse:110.75678 [20] validation_0-rmse:21.06562 validation_1-rmse:42.97821 [40] validation_0-rmse:14.83272 validation_1-rmse:33.75202 [60] validation_0-rmse:13.20043 validation_1-rmse:32.41903 [80] validation_0-rmse:12.03843 validation_1-rmse:32.01924 [100] validation_0-rmse:11.16971 validation_1-rmse:31.98212 [120] validation_0-rmse:10.32577 validation_1-rmse:31.87761 [140] validation_0-rmse:9.64544 validation_1-rmse:31.72762 [160] validation_0-rmse:8.97175 validation_1-rmse:31.71167 [180] validation_0-rmse:8.37861 validation_1-rmse:31.73429 [200] validation_0-rmse:7.92080 validation_1-rmse:31.72392 [220] validation_0-rmse:7.42729 validation_1-rmse:31.73696 [240] validation_0-rmse:6.85370 validation_1-rmse:31.72332 [260] validation_0-rmse:6.51763 validation_1-rmse:31.76170 [280] validation_0-rmse:6.17613 validation_1-rmse:31.70083 [300] validation_0-rmse:5.81158 validation_1-rmse:31.70956 [320] validation_0-rmse:5.53704 validation_1-rmse:31.73578 [340] validation_0-rmse:5.18261 validation_1-rmse:31.77532 [360] validation_0-rmse:4.87300 validation_1-rmse:31.75781 [380] validation_0-rmse:4.63794 validation_1-rmse:31.74164 [400] validation_0-rmse:4.38308 validation_1-rmse:31.72114 [420] validation_0-rmse:4.17729 validation_1-rmse:31.73294 [440] validation_0-rmse:3.93979 validation_1-rmse:31.74836 [460] validation_0-rmse:3.63146 validation_1-rmse:31.74898 [480] validation_0-rmse:3.38108 validation_1-rmse:31.76431 [500] validation_0-rmse:3.14939 validation_1-rmse:31.74403 [520] validation_0-rmse:3.00472 validation_1-rmse:31.74391 [540] validation_0-rmse:2.78350 validation_1-rmse:31.75152 [560] validation_0-rmse:2.62567 validation_1-rmse:31.73731 [580] validation_0-rmse:2.43262 validation_1-rmse:31.74129 [600] validation_0-rmse:2.29920 validation_1-rmse:31.73506 [620] validation_0-rmse:2.14859 validation_1-rmse:31.74324 [640] validation_0-rmse:2.01485 validation_1-rmse:31.75158 [658] validation_0-rmse:1.90277 validation_1-rmse:31.74793 RMSE of the xgbm model on the validation set: 31.691194462756986
# table of feature importance
xgbr_imp = [t for t in zip(features, xgbr.feature_importances_)]
xgbr_imp_df = pd.DataFrame(xgbr_imp, columns=['feature', 'varimp'])
xgbr_imp_df = xgbr_imp_df.sort_values('varimp', ascending=False)
# top 10 feature importances
px.pie(xgbr_imp_df.head(10), names='feature', values='varimp', title='Top 10 Feature Importance for XG Boost', hole=.2, template='ggplot2')
The lag 24 feature is the most important, followed by the lag 1, among the XG Boost model. The RMSE score is 31.69.
# LGBM
# Create a LightGBM dataset
lgb_train = lgb.Dataset(X_train, y_train)
lgb_valid = lgb.Dataset(X_valid, y_valid, reference=lgb_train)
# Define the parameters for the LightGBM model
params = {
'objective': 'regression',
'metric': 'root_mean_squared_error',
'boosting_type': 'gbdt',
'random_state': 19
}
# Train the LightGBM model
lgbm = lgb.train(params, lgb_train, valid_sets=lgb_valid, num_boost_round=500, early_stopping_rounds=50)
# Make predictions on the test set
predictions_valid_lgbm = lgbm.predict(X_valid)
result = mse(y_valid, predictions_valid_lgbm) ** 0.5 # calculate RMSE on validation set
print()
print("RMSE of the lgbm model on the validation set:", result)
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002680 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 4548 [LightGBM] [Info] Number of data points in the train set: 3512, number of used features: 29 [LightGBM] [Info] Start training from score 74.420273 [1] valid_0's rmse: 55.5412 Training until validation scores don't improve for 50 rounds
C:\Users\XIX\anaconda3\lib\site-packages\lightgbm\engine.py:181: UserWarning: 'early_stopping_rounds' argument is deprecated and will be removed in a future release of LightGBM. Pass 'early_stopping()' callback via 'callbacks' argument instead.
[2] valid_0's rmse: 53.2132 [3] valid_0's rmse: 50.8822 [4] valid_0's rmse: 48.7882 [5] valid_0's rmse: 47.0952 [6] valid_0's rmse: 45.4315 [7] valid_0's rmse: 44.125 [8] valid_0's rmse: 42.941 [9] valid_0's rmse: 41.8438 [10] valid_0's rmse: 40.8164 [11] valid_0's rmse: 39.8395 [12] valid_0's rmse: 38.923 [13] valid_0's rmse: 38.1671 [14] valid_0's rmse: 37.5767 [15] valid_0's rmse: 37.0534 [16] valid_0's rmse: 36.5322 [17] valid_0's rmse: 36.0722 [18] valid_0's rmse: 35.8637 [19] valid_0's rmse: 35.2914 [20] valid_0's rmse: 34.9924 [21] valid_0's rmse: 34.729 [22] valid_0's rmse: 34.3941 [23] valid_0's rmse: 34.0822 [24] valid_0's rmse: 33.7881 [25] valid_0's rmse: 33.492 [26] valid_0's rmse: 33.4272 [27] valid_0's rmse: 33.223 [28] valid_0's rmse: 33.1671 [29] valid_0's rmse: 33.0753 [30] valid_0's rmse: 32.9774 [31] valid_0's rmse: 32.8236 [32] valid_0's rmse: 32.7522 [33] valid_0's rmse: 32.599 [34] valid_0's rmse: 32.5062 [35] valid_0's rmse: 32.4323 [36] valid_0's rmse: 32.3681 [37] valid_0's rmse: 32.2884 [38] valid_0's rmse: 32.198 [39] valid_0's rmse: 32.1224 [40] valid_0's rmse: 32.1494 [41] valid_0's rmse: 32.1236 [42] valid_0's rmse: 32.0699 [43] valid_0's rmse: 32.0329 [44] valid_0's rmse: 31.9388 [45] valid_0's rmse: 31.8187 [46] valid_0's rmse: 31.7902 [47] valid_0's rmse: 31.7698 [48] valid_0's rmse: 31.7703 [49] valid_0's rmse: 31.8217 [50] valid_0's rmse: 31.7187 [51] valid_0's rmse: 31.7468 [52] valid_0's rmse: 31.7403 [53] valid_0's rmse: 31.7101 [54] valid_0's rmse: 31.7397 [55] valid_0's rmse: 31.707 [56] valid_0's rmse: 31.7258 [57] valid_0's rmse: 31.6866 [58] valid_0's rmse: 31.6803 [59] valid_0's rmse: 31.6853 [60] valid_0's rmse: 31.7247 [61] valid_0's rmse: 31.7116 [62] valid_0's rmse: 31.7258 [63] valid_0's rmse: 31.7189 [64] valid_0's rmse: 31.6419 [65] valid_0's rmse: 31.6094 [66] valid_0's rmse: 31.6382 [67] valid_0's rmse: 31.6472 [68] valid_0's rmse: 31.6152 [69] valid_0's rmse: 31.5981 [70] valid_0's rmse: 31.6063 [71] valid_0's rmse: 31.6044 [72] valid_0's rmse: 31.606 [73] valid_0's rmse: 31.6105 [74] valid_0's rmse: 31.6097 [75] valid_0's rmse: 31.6225 [76] valid_0's rmse: 31.5469 [77] valid_0's rmse: 31.5437 [78] valid_0's rmse: 31.5638 [79] valid_0's rmse: 31.5432 [80] valid_0's rmse: 31.5412 [81] valid_0's rmse: 31.5371 [82] valid_0's rmse: 31.5106 [83] valid_0's rmse: 31.5192 [84] valid_0's rmse: 31.5331 [85] valid_0's rmse: 31.5491 [86] valid_0's rmse: 31.5806 [87] valid_0's rmse: 31.5929 [88] valid_0's rmse: 31.5815 [89] valid_0's rmse: 31.5635 [90] valid_0's rmse: 31.5644 [91] valid_0's rmse: 31.5619 [92] valid_0's rmse: 31.5618 [93] valid_0's rmse: 31.5567 [94] valid_0's rmse: 31.5851 [95] valid_0's rmse: 31.598 [96] valid_0's rmse: 31.5681 [97] valid_0's rmse: 31.5707 [98] valid_0's rmse: 31.5517 [99] valid_0's rmse: 31.5489 [100] valid_0's rmse: 31.566 [101] valid_0's rmse: 31.5811 [102] valid_0's rmse: 31.584 [103] valid_0's rmse: 31.6054 [104] valid_0's rmse: 31.6333 [105] valid_0's rmse: 31.6272 [106] valid_0's rmse: 31.6618 [107] valid_0's rmse: 31.6606 [108] valid_0's rmse: 31.6653 [109] valid_0's rmse: 31.6626 [110] valid_0's rmse: 31.6922 [111] valid_0's rmse: 31.7095 [112] valid_0's rmse: 31.6929 [113] valid_0's rmse: 31.6874 [114] valid_0's rmse: 31.6709 [115] valid_0's rmse: 31.6573 [116] valid_0's rmse: 31.6559 [117] valid_0's rmse: 31.6572 [118] valid_0's rmse: 31.666 [119] valid_0's rmse: 31.6794 [120] valid_0's rmse: 31.6797 [121] valid_0's rmse: 31.608 [122] valid_0's rmse: 31.6097 [123] valid_0's rmse: 31.6293 [124] valid_0's rmse: 31.6424 [125] valid_0's rmse: 31.6139 [126] valid_0's rmse: 31.6059 [127] valid_0's rmse: 31.6064 [128] valid_0's rmse: 31.6036 [129] valid_0's rmse: 31.6035 [130] valid_0's rmse: 31.608 [131] valid_0's rmse: 31.6322 [132] valid_0's rmse: 31.6515 Early stopping, best iteration is: [82] valid_0's rmse: 31.5106 RMSE of the lgbm model on the validation set: 31.5105670958812
# Get the feature importances of the trained model
lgbm_importances = lgbm.feature_importance()
# Create a dataframe with the feature importances and the corresponding feature names
lgbm_importances_df = pd.DataFrame({'feature':X_train.columns, 'importance':lgbm_importances})
# Sort the dataframe by importance
lgbm_importances_df.sort_values(by='importance', ascending=False, inplace=True)
# top 10 feature importances
px.pie(lgbm_importances_df.head(10), names='feature', values='importance', title='Top 10 Feature Importance for Light GBM Regression', hole=.2, template='ggplot2')
The lag 24 feature is the most important, followed by the lag 1, 3, and 17, among the light GB model. The RMSE score is 31.51.
# catboost
catb = CatBoostRegressor(task_type='GPU', loss_function='RMSE', eval_metric='RMSE', iterations=2000, random_seed=19, early_stopping_rounds=500)
catb.fit(X_train, y_train, eval_set=(X_valid, y_valid), verbose=100, use_best_model=True, plot=True)
# Make predictions on the test set
predictions_valid_catb = catb.predict(X_valid)
result = mse(y_valid, predictions_valid_catb) ** 0.5 # calculate RMSE on validation set
print()
print("Catboost model on the test set: ")
catb.best_score_
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
# Get the feature importances of the trained model
catb_importances = catb.feature_importances_
# Create a dataframe with the feature importances and the corresponding feature names
catb_importances_df = pd.DataFrame({'feature':X_train.columns, 'importance':catb_importances})
# Sort the dataframe by importance
catb_importances_df.sort_values(by='importance', ascending=False, inplace=True)
# top 10 feature importances
px.pie(catb_importances_df.head(10), names='feature', values='importance', title='Top 10 Feature Importance for Catboost Regression', hole=.2, template='ggplot2')
The lag 24 feature is the most important, followed by the lag 1, among the catboost model. The RMSE score is 30.42.
# Training classifiers
# reg1 = RandomForestRegressor(random_state=12345, n_estimators=90, max_depth=100)
# reg2 = GradientBoostingRegressor(random_state=19, learning_rate=0.2, n_estimators=1000, verbose=1, max_depth=3)
reg3 = XGBRegressor(learning_rate=0.09, n_estimators=800, eval_metric='rmse', random_state=19, max_depth=6) #, early_stopping_rounds=500)
reg4 = lgb.LGBMRegressor(objective='regression', metric='root_mean_squared_error', boosting_type='gbdt', random_state=19) #, early_stopping_rounds=50)
reg5 = CatBoostRegressor(task_type='GPU', loss_function='RMSE', eval_metric='RMSE', iterations=2000, random_seed=19) #, early_stopping_rounds=500)
ereg = VotingRegressor(estimators=[#('rf', reg1),
#('gbr', reg2),
('xgb', reg3),
('lgb', reg4),
('cat', reg5)],
verbose=1)
ereg = ereg.fit(X_train, y_train)
# Make predictions on the test set
predictions_valid_ereg = ereg.predict(X_valid)
result = mse(y_valid, predictions_valid_ereg) ** 0.5 # calculate RMSE on validation set
print()
print("voting regressor model on the valid set: ", result)
The best model we have is the voting regressor. This model combines the three best performing models and parameters: XG boost, Light GB, and Catboost. The RMSE score is 30.78 with the validation set.
# making model scores dataframe
model_scores = pd.DataFrame({'Linear Regression': 34.28, 'Decision Tree': 38.54, 'Random Forest': 32.01, 'Ada Boost': 34.89, 'Gradient Boost': 35.11, 'XG Boost': 31.69, 'Light GBM': 31.51,
'Catboost': 30.42, 'Voting Regressor': 30.78}, index={'RMSE'})
model_scores = model_scores.T
# Model RMSE scores
px.scatter(model_scores, title='Model RMSE Scores', template='ggplot2', color=model_scores.index, size='RMSE', y='RMSE', size_max=30, labels={'index': 'Model'})
This figure compares the RMSE scores of the various models. The three best performing models are the Gradient boost, XG boost, and Catboost. Finally, the Voting regressor achieved a great score.
# combine validation set with training set
X_full = pd.concat([X_train, X_valid])
y_full = pd.concat([y_train, y_valid])
# Training classifiers
# reg1 = RandomForestRegressor(random_state=12345, n_estimators=90, max_depth=100)
# reg2 = GradientBoostingRegressor(random_state=19, learning_rate=0.2, n_estimators=1000, verbose=1, max_depth=3)
reg3 = XGBRegressor(learning_rate=0.09, n_estimators=800, eval_metric='rmse', random_state=19, max_depth=6) #, early_stopping_rounds=500)
reg4 = lgb.LGBMRegressor(objective='regression', metric='root_mean_squared_error', boosting_type='gbdt', random_state=19) #, early_stopping_rounds=50)
reg5 = CatBoostRegressor(task_type='GPU', loss_function='RMSE', eval_metric='RMSE', iterations=2000, random_seed=19) #, early_stopping_rounds=500)
final = VotingRegressor(estimators=[#('rf', reg1),
#('gbr', reg2),
('xgb', reg3),
('lgb', reg4),
('cat', reg5)],
verbose=1)
final = final.fit(X_full, y_full)
# Make predictions on the test set
final_predictions = final.predict(X_test)
result = mse(y_test, final_predictions) ** 0.5 # calculate RMSE on validation set
print()
print("voting regressor model on the test set: ", result)
The final model we chose was the voting regressor. We achieved a RMSE score of 41.24 with the test set. This model performs better than the required RMSE score of 48. Therefore, we have accurately predicted the future number of orders.
Overall, we succeeded in providing a model for Sweet Lift Taxi to predict the number of orders of the next hour. The target metric for our model was an RMSE score under 48. Our final model was a voting regressor, with a final RMSE of 46.47 with the test data set. Therefore, Sweet Lift can accommodate drivers with a model that accurately predicts future number of orders.